Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models

نویسندگان

چکیده

We investigate the use of multimodal information contained in images as an effective method for enhancing commonsense Transformer models text generation. perform experiments using BART and T5 on concept-to-text generation, specifically task generative reasoning, or CommonGen. call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning representing appropriate everyday scenarios, these captions to enrich steer generation process. Comprehensive evaluation analysis demonstrate that noticeably improves model performance while successfully addressing several issues baseline generations, including poor commonsense, fluency, specificity.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning a Recurrent Visual Representation for Image Caption Generation

In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a common embedding, we enable the generation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associated wi...

متن کامل

Image Caption Generation with Text-Conditional Semantic Attention

Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, existing methods use only visual content as attention and whether textual context can improve attention in image captioning remains unsolved. To explore this problem, we propose a novel attention mechanism, called textconditional attention, which allows the caption generator t...

متن کامل

Grounding commonsense knowledge in intelligent systems

Ambient environments which integrate a number of sensing devices and actuators intended for use by human users need to be able to express knowledge about objects, their functions and their properties to assist in the performance of everyday tasks. For this to occur perceptual data must be grounded to symbolic information that in its turn can be used in the communication with the human. For symb...

متن کامل

Review Networks for Caption Generation

We propose a novel extension of the encoder-decoder framework, called a review network. The review network is generic and can enhance any existing encoderdecoder model: in this paper, we consider RNN decoders with both CNN and RNN encoders. The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review s...

متن کامل

Generation and grounding of natural language descriptions for visual data

Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on bot...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2022

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v36i10.21306